This Page Is Inserted by IFW Operations 
and is not a part of the Official Record 

BEST AVAILABLE IMAGES 



Defective images within this document are accurate representations of 
the original documents submitted by the applicant. 

Defects in the images may include (but are not limited to): 



BLACK BORDERS 

TEXT CUT OFF AT TOP, BOTTOM OR SIDES 
FADED TEXT 
ILLEGIBLE TEXT 
SKEWED/SLANTED IMAGES 
COLORED PHOTOS 

BLACK OR VERY BLACK AND WHITE DARK PHOTOS 
GRAY SCALE DOCUMENTS 



IMAGES ARE BEST AVAILABLE COPY. 



As rescanning documents will not correct images, 
please do not report the images to the 
Image Problem Mailbox. 



f ■ * 



STIC-ILL 



From: 

Sent: 

To: 

Subject: 



Portner, Ginny 

Thursday. August 27, 1998 4:13 PM 

STIC-ILL 

FROM 1641 



IDENTIFICATION OF CLUSTERS OF B I ALLELIC POLYMORPHIC SEQUENCE-TAGGED SITES 
PSTSS THAT GENERATE HIGHLY INFORMATIVE AND AUTOMATABLE MARKERS FOR GENETIC 
LINKAGE MAPPING 

%%%NICKERSON D A%%%; WHITEHURST C; BOYSEN C; CHARMLEY P; KAISER R; HOOD L 
DIV. BIOL., 139-74, CALIF. INST. TECHNOL, PASADENA, CALIF. 91125. 
GENOMICS 12 (2). 1992. 377-387. CODEN: GNMCE 
Full Journal Title: Genomics 
Language: ENG 



Page 1 



GENOMICS 12, 377-387 (1992) 



Identification of Clusters of Biallelic Polymorphic Sequence-Tagged 
Sites (pSTSs) That Generate Highly Informative and Automatable 
Markers for Genetic Linkage Mapping 

Deborah A. Nickerson, Charles Whitehurst, Cecilie Boysen, 
Patrick Charmley, Robert Kaiser, and Leroy Hood 

Division of Biology, 139 74, California Institute of Technology, Pasadena, California 91125 
Received August 12, 1991; revised October 15, 1991 



Using a combination of denaturing gradient gel electropho- 
resis and direct DN A sequencing, we have found that multiple 
(4 to 7) biallelic sequence polymorphisms can be located 
within short DNA segments, 300 to 2400 bp. Here, we report 
on the identification of three clusters of DNA polymorphisms, 
one in each of the constant regions of the human T cell recep- 
tor a and f$ gene complexes on human chromosomes 14 and 7, 
respectively, and a third among the human t-RNA genes on 
human chromosome 14. The frequency of these polymor- 
phisms and the extent of linkage disequilibrium between indi- 
vidual polymorphisms have been determined using a semiauto- 
mated DNA typing system combining DNA target amplifica- 
tion by the polymerase chain reaction with the analysis of 
internal sequence polymorphisms by a colorimetr ic oligonucle- 
otide ligation assay. We have found that individual biallelic 
polymorphisms in each cluster are often in partial linkage dis- 
equilibrium with one another. This partial linkage disequilib- 
rium permits the combined use of three to four markers in a 
cluster to generate a haplotype with high levels of heterozy- 
gosity, 71 to 88%. Therefore, clusters of physically linked 
biallelic polymorphisms provide an automatable and highly 
informative type of genetic marker for general linkage analy- 
sis as well as an attractive alternative marker system for fine- 
point mapping of disease-causing genes and phenotypic traits 
relative to their framework locations in the genome. © i»92 

Academic Press, Inc. 



INTRODUCTION 

The identification and detection of DNA sequence 
polymorphisms plays a fundamental role in understand- 
ing genome structure and function through genetic link- 
age mapping (Botstein et aL, 1980; Donis-Keller et aL, 
1987). Many types of sequence polymorphisms are pres- 
ent in the genome and can be employed in genetic link- 
age analysis. Two major types of DNA polymorphisms 
stem from variations in the number of repeat units, i.e., 
the simple dinucleotide (CA^GTJ repeats (Weber and 
May, 1989), and the more complex variable number tan- 
dem repeats (VNTRs) (Jeffreys et aL, 1986; Nakamura 
et aL, 1987). These polymorphisms frequently have mul- 



tiple alleles in a population. Therefore, they have a high 
probability of occurring in different forms on the two 
copies of any given chromosome within a single individ- 
ual, which makes them highly informative markers for 
genetic linkage mapping. Another major type of DNA 
polymorphism comes from discrete changes in a specific 
DNA sequence, i.e., single nucleotide substitutions 
(Botstein et aL, 1980). These polymorphisms are the 
most frequent and widely distributed type of sequence 
variation in the genome and are usually biallelic in the 
population. Individual biallelic polymorphisms are 
usually not as informative as polymorphic repeats. How- 
ever, multiple closely linked markers can be combined 
into haplotypes that can be as informative as a repeat 
polymorphism (Donis-Keller etaL, 1986). 

The identification and analysis of DNA polymor- 
phisms has been greatly facilitated by the development 
of the polymerase chain reaction (PCR), a method that 
rapidly and exponentially amplifies the specific target 
sequences located between two oligonucleotide primers 
(Saiki et aL, 1988). PCR amplification is rapidly chang- 
ing the way in which DNA analysis is performed through 
the development of genetic mapping strategies based on 
polymorphic sequence- tagged ^ites (pSTSs). A pSTS is 
any unique but short genomic sequence amplified by 
PCR (Olson et aL, 1989) that also contains an identified 
sequence polymorphism. Most types of DNA polymor- 
phisms, including simple nucleotide substitutions (Feld- 
man et aL, 1988), dinucleotide repeats (Weber and May, 
1989), and complex VNTR repeats (Jeffreys et aL, 1988), 
can be obtained as pSTSs. However, the analysis of 
pSTSs, much like restriction fragment length polymor- 
phisms (RFLPs), still relies heavily on gel electrophore- 
sis. In addition to limiting sample throughput and the 
automation potential of DNA analysis, gel electrophore- 
sis also has several disadvantages stemming from the 
need for appropriate internal standards and band- 
matching criteria to compensate for distortion prob- 
lems, such as band shifting (Lander, 1991). Further- 
more, the analysis of amplified DNA products by gel 
electrophoresis can often be complicated by the presence 
of artifact bands (Dracopoli and Meisler, 1990). This is 
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particularly a problem with sequences containing re- 
peats, where recombinant products can form during the 
amplification process (Meyerhans et aL, 1990). These 
problems make electrophoretic analysis difficult to in- 
terpret by a computer. Therefore, the development of 
automated and easily interpreted systems for the analy- 
sis of size variant pSTSs may be difficult. In contrast, 
systems for analyzing simple nucleotide substitutions, 
i.e., biallelic pSTSs, can be automated and easily inter- 
preted by a computer, e.g., PCR combined with the oligo- 
nucleotide ligation assay (Nickerson et al t 1990). The 
current major limitation of biallelic pSTSs is that indi- 
vidually these markers are not highly informative. In the 
present study, we demonstrate an interesting phenome- 
non related to the informativeness of multiple biallelic 
pSTSs. We have found that clusters of biallelic poly- 
morphisms in partial linkage disequilibrium are located 
within short DNA segments, 300 to 2400 bp, and can be 
used to generate automatable and highly informative 
haplotypes with heterozygosities of 71 to 88%. 



MATERIALS AND METHODS 

Oligonucleotides. Amplification primers and ligation probes were 
assembled using standard phosphoramidite chemistry on an Applied 
Biosystems 380A DNA synthesizer. The sequences of all amplifica- 
tion primers and ligation probes are shown in Table 1. Each of the 5' 
allele-specific ligation probes was biotinylated (B, Table 1) as previ- 
ously described by Landegren et aL (1988). The adjoining 3' reporter 
probes for the ligation assay were phosphorylated (P, Table 1) using 
"5'-phosphate-on" (Clontech) according to the manufacturers' in- 
structions. All ligation probes were purified by reverse-phase HPLC 
and the phosphorylated 3 ' reporter probes enzymatically labeled with 
digoxigenin (D, Table 1) as previously described (Nickerson et aL, 
1990). 

DNA amplification. Human genomic DNA samples were ampli- 
fied from a set of 10 unrelated Caucasians, the parents from the 40 
CEPH (Centre d'Etude du Polymorphisme Humaine, Paris) families, 
or selected CEPH families kindly provided by Dr. Richard Gatti. Geno- 
mic DNA samples (10 to 100 ng starting template) were mixed with a 
buffer (10 mM Tris-HCl, pH 8.3, 50 mAf KC1, 1.5 mAf MgCl 2 , and 
0 001% gelatin) containing 40 $iAf of each of the four deoxynucleotide 
triphosphates (dATP, dGTP, dCTP, and dTTP), 0. 1 pM of each of the 
amplification primers (Table 1), and Taq polymerase (25 U/ml), and 
amplified by 40 cycles of 20 s at 93°C, 40 s at the specified annealing 
temperature (Table 1), and 1 min 30 s at 72°C. Genomic DNA (10 ng) 
from the 80 CEPH parents was amplified for OLA analysis in 96-well 
microtiter plates (MJ Research, Watertown, MA). Amplification reac- 
tions were assembled as previously described (Nickerson et aL, 1990) 
using a robotic workstation (Biomek 1000, Beckman Instruments, 
Palo Alto, CA). ** 

Denaturing gradient gel electrophoresis. The analysis of non-GC- 
clamped, amplified DNA samples by denaturing gradient gel electro- 
phoresis (DGGE) was performed as described by Sheffield et aL (1989) 
using an electrophoretic apparatus obtained from Green Mountain 
Supplies (Waltham, MA). Each gel was composed of 7% acrylamide 
(37.5:1 acrylamideibisacrylamide) and a gradient of denaturants 
(100% denaturant = 7 Af urea and 40% (v/v) formamide) prepared in 
the apparatus to run either parallel or perpendicular to the electropho- 
retic field. Amplified DNA samples were denatured for 5 min at 100°C 
and cooled to room temperature for 10 min prior to gel loading to 
permit the formation of heteroduplexes. Amplified samples were elec- 
trophoresed for 9 h at 150 V for parallel gels and 6 h at 150 V for 
perpendicular gels. Following electrophoresis the gels were stained 
with ethidium bromide and photographed under uv transillumination. 
The midpoint melting temperatures (TJ for DNA regions of interest 



were calculated using the Melt87 computer program generously pro- 
vided by L. S. Lerman (Lerman and Silverstein, 1987) and melting 
maps generated by importing these data into a commercial plotting 
program (SigmaPlot, Jandel Scientific, Corte Madera, CA). 

DNA sequencing. Amplified DNA samples were purified by elec- 
trophoresis in a 1% low melt agarose gel, and the target band excised 
and sequenced directly from the agarose plug using the amplification 
primers also as sequencing primers as described by Kretz et aL (1989). 

Oligonucleotide ligation assay. Oligonucleotide ligation assays 
(OLA) for each of the described sequence polymorphisms were per- 
formed using a robotic workstation as detailed in Nickerson et aL 
(1990). Briefly, OLA employs two short (15- to 25-mers) adjacent oligo- 
nucleotide probes in the analysis of biallelic pSTSs. For each polymor- 
phism, three oligonucleotides are synthesized, two biotinylated 
probes, one for each of the allelic forms of the polymorphism with the 
3' end of these probes positioned on the polymorphic nucleotide, and 
one adjacent 3' reporter oligonucleotide probe common to both alleles 
and labeled with digoxigenin (Table 1). Analysis of an amplified DNA 
sample is performed using two separate ligation reactions. Each reac- 
tion contains one of the 5' biotinylated probes, the common 3' reporter 
probe, an aliquot of the amplified DNA target, and T4 DNA ligase. 
When the ligating probes are hybridized to a perfectly complementary 
target, T4 ligase can covalently join the 5' biotinylated probe to the 3' 
reporter probe. If the probes are mismatched at their target junction, 
T4 ligase does not form a covalent bond between the probes. For assay 
readout the 5' biotinylated probe is captured on a streptavidin-coated 
microtiter plate and an enzyme- linked immunosorbent assay 
(ELISA) for digoxigenin reporter is performed. The presence (target 
match) or absence (target mismatch) of a colored product is measured 
spectrophotometrically at 490 nM using a microtiter plate reader. 
Sample genotypes -«re then determined by a computer program that 
calculates the mean absorbances from triplicate ligation reactions for 
each allele and then takes a ratio of these means to call a genotype 
(Nickerson et aL, 1990). Sample genotypes are transferred to a data- 
base program (dBASE III) for calculation of allele frequencies and 
calculation of observed haplotype frequencies. Double heterozygous 
individuals are excluded in the pairwise determination of observed 
haplotype frequencies. Linkage disequilibrium, i.e., the extent of 
nonrandom allelic association, between pairs of DNA polymorphisms, 
is calculated by using the Q statistic described in detail by Hedrick et 
aL (1986). The Q statistic is a x 2 distributed measure of linkage disequi- 
librium that sums the differences between the observed haplotype fre- 
quencies and the expected haplotype frequencies (calculated by assum- 
ing random allelic association between pairs of DNA polymorphisms). 
The x 2 probabilities are calculated using onefcegree of freedom. 



RESULTS 

A Highly Informative Cluster of DNA Polymorphisms Is 
Present in the T Cell Receptor a Chain Constant 
Region (Ca) 

Our initial strategy for identifying DNA polymor- 
phisms in the T cell receptor (TCR) constant region was 
to amplify a specific DNA segment from 10 unrelated 
human DNA samples and to scan these for sequence 
polymorphisms using parallel denaturing gradient gels. 
Using this strategy, we examined a 466-bp region located 
in the third- intron of the Ca gene (Yoshikai et aL, 1985) 
and discovered a highly polymorphic and informative 
melting pattern consisting of four homozygous melting 
variants in addition to a number of heterozygous combi- 
nations of the four homozygous variants. A parallel dena- 
turing gradient gel of these four homozygous variants 
and the six possible heterozygous melting variants is 
shown in Fig. 1A. These 10 different combinations seem 
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FIG. 1. DGGE analysis of an amplified C« gene segment from intron 3. (A) An ethidium bromide-stained parallel f^^f^t** g 
to 80% eradient) showing the four homozygous variants (1,1; 2,2; 3,3; 4,4) and the six heterozygous variants (1,2, 2,3 3,3, 3,4, 13, 2,4, 1,4) 
L^rld by the £^i«s pairing of the four homozygous forms. (B) The pedigree and DGGE genotypes of *^*™^«*^ 
analyzed ^^n a parallel denaturing graLnt gel (0 to 80% gradient) using amplified DNA products obtained from each of the family members. 



to represent the major forms of polymorphism for this 
region since we have been unable to identify any other 
banding patterns with the analysis of 50 additional DNA 
samples (CEPH parents, data not shown). The inheri- 
tance of these Ca melting variants through extended 
families (CEPH) has also been examined by parallel 
DGGE. Each variant was inherited in a Mendelian fash- 
ion (Fig. IB). 

Analysis of the amplified Ca segment by perpendicu- 
lar DGGE revealed a two-domain melting structure (Fig. 
2A). The presence of two melting domains was also pre- 
dicted by the construction of a melting map for this Ca 
region and suggested the presence of a 5' 150-bp high 
temperature domain and a lower temperature domain 
for the remaining downstream (3') sequence (300 bp, 
Fig. 2B). Based on these domain properties and the abil- 
ity of DGGE to detect polymorphisms located in only 
the lower temperature domains (Fischer and Lerman, 
1983; Myers et al, 1987), we predicted that the sequence 
change or changes responsible for the different melting 
variants would likely reside within the 300-bp lower tem- 
perature domain. To further evaluate this, we amplified 
and sequenced DNA samples from each of the homozy- 
gous melting variants (Fig. 1A). Upon comparison of 
these sequences we found that four biallelic nucleotide 
variations, Cal through Ca4, were responsible for gen- 
erating these different melting variants. The nucleotide 
substitutions located at each of these polymorphic sites 
are shown in Fig. 3. As shown by the melting map illus- 



trated in Fig. 2B, the most 5' sequence polymorphism, 
Cal, was located in the highest temperature domain 
while the remaining polymorphisms were located in the 
lower temperature domain. Because of its location in the 
higher temperature domain, we suspected that the Cal 
polymorphism was not responsible for any of the mobil- 
ity shifts detected by DGGE. This was later confirmed 
by the observation that DNA samples heterozygous at 
Cal, but homozygous at Ca2, Ca3, and Ca4, form a sin- 
gle homozygous band on a parallel denaturing gradient 
gel (data not shown). In contrast to Cal, the other se- 
quence polymorphisms (located in the lower tempera- 
ture domain) did appear to affect the relative mobility 
pattern of the Ca segment when analyzed by parallel 
DGGE. More importantly, the relative migration pat- 
terns for each of these melting variants in the denatur- 
ing gel corresponded directly to the number of GC versus 
AT basepairs at the three downstream polymorphic 
sites. Note that variants having more GC basepairs at 
these polymorphic sites migrated further into the gel 
(compare Figs. 1A and 3). Therefore, we surmise that 
each replacement of an AT with a GC basepair served to 
further stabilize the lower temperature domain, result- 
ing in a slightly higher melting temperature and, as a 
consequence, further migration into the gel. 

To rapidly obtain estimates of the allele frequencies 
for each of these polymorphisms in a population, we uti- 
lized a nonisotopic semiautomated method, PCR/OLA 
(Nickerson et ai, 1990), to detect each of the polymor- 
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FIG. 2. Analysis of the Ca gene segment. (A) An ethidium bromide-stained perpendicular denaturing gradient gel using amplified DNA 
products obtained from a pooled DNA template (equivalent amounts of DNA from 20 unrelated individuals) showing both the domain structure 
for this region and the large number of homozygous and heteroduplex alleles that can be detected in this highly informative region. (B) 
Calculated melting map of the amplified Ca fragment showing the relative location of the four nucleotide substitutions in this gene segment. 



phic alleles at these four sites. An example of these re- 
sults with DNA samples from homozygous and heterozy- 
gous individuals is shown in Fig. 3. Clearly, the PCR/ 
OLA procedure is a highly effective approach for 
discriminating each of the allelic forms at these four 
sites. Analysis of a population of 80 individuals (the 
CEPH parents) by PCR/OLA revealed that all four Ca 
polymorphisms had genotype distributions indicating 
Hardy-Weinberg equilibrium. A pairwise analysis of 
linkage disequilibrium (nonrandom association) be- 



tween theje polymorphisms showed a moderate level of 
disequilibrium (P = 8.3 X 10~ 3 to 9.8 X 10~ 6 ) between 
most polymorphisms with the exception of the Cal and 
Ca3 pair which revealed a much higher level of linkage 
disequilibrium in this population (P = 5.9 X 10" 17 , Table 
2). Although each polymorphism revealed a moderate 
degree of heterozygosity among the 80 CEPH parents, 
25 to 41% (Table 2), the degree of heterozygosity in- 
creased markedly when these polymorphisms were com- 
bined into a haplotype. The combined heterozygosity of 



Homozygous 
variants: 



5'- 





c 




A 




A 




3 






I 

G 




* 

C 




I 

C 




t 

A 





1 5- 

2 5'- 

3 5*- 

4 5'- 



-A- 
-A- 
-A- 



-A- 
-A- 



Nucleotide Substitution Detected by OLA 

c G a c a c G A 

I ■ I I ■ I I ■ I I I I 

Ws^^'. £)•«:•: * : •♦•v /} 

# • • *a • • • o <>•• • c 

FIG. 3. Sequence and OLA analysis of the Ca polymorphisms. (Top) A schematic diagram indicating the nucleotide variations identified by 
sequencing amplified DNA samples from each of the four homozygous DGGE variants. (Bottom) ELISA-based OLA analysis of the four Ca 
sequence polymorphisms using amplified DNA samples from homozygous and heterozygous individuals. OLAs for the each allele were per- 
formed in triplicate. Microti ter wells containing the digoxigenin reporter form a colored product and indicate complete complementarity 
between the ligating probes and amplified DNA target. The absence of a colored product in the microtiter wells indicates a mismatch between 
the ligating probes and the amplified DNA target. 
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TABLE 2 

Allele Frequency, Heterozygosity, and Linkage Disequilibrium in the Cot pSTS Cluster 





Cal 


Ca2 


Nucleotide 


3480 


3645 


Allele frequency 


C: 68% 


A: ob /o 


G: 32% 


C: 14% 


Observed 






heterozygosity 


41% 


25% 


Linkage 






Disequilibrium 




P = 3.7 X 10" 5 ° 


Cal 




Ca2 


17.8, n = 130* 




Ca3 


70.0, n = 96 


17.7, n = 132 


Ca4 


19.1, n = 136 


7.0, n = 148 



Ca3 



Ca4 



3702 
A: 32% 
C: 68% 

40% 



P = 5.9 X 10 -17 
P = 2.6 X 10" 5 

19.6, n = 136 



3754 
G:69% 
A: 31% 

39% 



P = 1.2 x 10" 5 
P = 8.3 X 10~ 3 
P - 9.8 X 1CT 6 



a P values for the x 2 distribution of the Q statistic. 

b Values for Q and the number of chromosomes examined. 



the four polymorphisms was 71% by PCR/OLA and was 
identical to the heterozygosity obtained by combining 
only three of the polymorphisms, Cal/Ca2/Ca4 or Ca2/ 
C«3/Ca4. 

A Cluster of DNA Polymorphisms Located in the TCR 0 
Constant Region 2 (C02) 

To further explore the extent of DNA sequence poly- 
morphisms in the constant regions of the TCR, we used 
a strategy similar to that described for Ca to examine a 
483-bp segment from the TCR Cj82 region (104 bp from 
the 5' noncoding region and 379 bp from exon 1; Toyon- 
aga et ai, 1985). A biallelic polymorphism within this 
C£2 region was detected by analysis of amplified DNA 
from 10 unrelated individuals on a parallel denaturing 
gradient gel. The inheritance of this C/52 polymorphism, 
shown in Fig. 4, followed a Mendelian pattern. Sequence 
analysis of amplified DNA samples from the two homo- 
zygous melting variants indicated that a single nucleo- 
tide substitution in the lower melting domain of the C£2 
sequence was responsible for generating the different 
melting variants (Q823, a C to T base change, Fig. 5). 



However, like Ca, the complete sequence of this segment 
revealed the presence of another DNA polymorphism in 
the highest melting domain (Q824, Fig. 5). This poly- 
morphism resides within the coding region of C/32 (exon 
1) but does not lead to any amino acid substitutions. 
Although DGGE has proven effective in detecting DNA 
polymorphisms within the lower melting domains of 
these amplified fragments, we have found that direct 
DNA sequencing offers increased sensitivity in the de- 
tection of DNA polymorphisms. Furthermore, we have 
found that direct DNA sequencing of amplified DNA 
targets even with multiple individuals (e.g., 6 unrelated 
individuals) is more rapid than DGGE analysis since 
DNA sequencing: (i) can be performed under a standard 
set of conditions that offers considerable time-savings, 
(ii) will unambiguously determine whether a common 
DNA polymorphism (at least 1 in 6 individuals) is pres- 
ent across the entire scanned region (high or low melting 
domains), and (iii) provides the information necessary 
for high throughput typing and automated data analysis 
by approaches like OLA. In contrast, several types of 
gels, i.e., a perpendicular and/or parallel gels (0 to 80% 
and 30 to 80% gradients), must be explored to properly 
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FIG. 4. DGGE analysis of the C/32 gene segment. The pedigree and DGGE genotypes of a three generation family analyzed by a parallel 
denaturing gradient gel (30 to 80% gradient) with amplified C/32 products obtained from each of the family members. 
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a. p values for chi-square distribution of O. 

b. Values for Q, and the number of chromosomes examined. 

FIG. 5. The location, frequency, and linkage disequilibrium among the bialtelic DNA polymorphisms found in the C£2 gene segment of the 
human TCR. 



scan even the lower temperature domain of a specific 
DNA sequence by DGGE. Furthermore, even when a 
DNA polymorphism has been detected by DGGE, suffi- 
cient information is not obtained to permit high 
throughput typing or analysis. Therefore, on the basis of 
the ease, sensitivity, and speed of direct DNA sequenc- 
ing, we modified our approach to polymorphism scan- 
ning. Using a sequence -based approach, we examined an 
additional 1900 bp of noncoding sequence from C/32. 
Five additional single nucleotide variations were identi- 
fied in C/32 using a sequence-based approach (Q821, 
C/322, C/325, C026, and Cj827 designated according to 
their location in C02 5' to 3', Fig. 5). The majority of 
these additional sequence polymorphisms (four of five) 
were the result of transitional base substitutions (Q821, 
C/322, C025, and C/327). Moreover, two of these poly- 
morphisms provide the underlying sequence variations 
responsible for two previously reported RFLPs in Q32, 
the Bglll polymorphism (C021; Berliner et al y 1985), 
and the Kpnl polymorphism (Cj825; Perl et aL, 1989). 

Using a combination of PCR and OLA (Table 1) we 
have confirmed that each of these seven C/32 polymor- 
phisms followed a Mendelian inheritance pattern. Fig- 
ure 5 shows the location, allele frequencies, and hetero- 
zygosities for these polymorphisms when analyzed by 
PCR/OLA for 80 DNA samples (the CEPH parents). 
Six of the seven polymorphisms had allele frequencies 
approximating 50:50 while the remaining polymor- 
phism, C/323, revealed an obvious major and minor allele 
pattern (approximately 1:2). All polymorphisms had ge- 
notype distributions that were not significantly differ- 
ent from those calculated assuming Hardy- Weinberg 



equilibrium. Furthermore, we find that these seven 
polymorphism are in varying degrees of linkage disequi- 
librium with one another (Fig. 5);. On the basis of the 
moderate levels of disequilibrium present in this region, 
a highly informative haplotype can be generated using 
only four or five of these markers (heterozygosity of 88% 
for the Q321/Q322/C/324/C025 combination and 89% 
for the Cjm/C022/C/?24/C/J25/CjS26 combination). As 
would be predicted, pairs of polymorphisms exhibiting 
the highest levels of disequilibrium, e.g., C/?22/Cj£?23, 
C/J25/Q826, Cj325/C/327, and Cj826/CjS27, contributed 
the least to the overall informativeness of a haplotype 
from this region. 

A Cluster of Sequence Polymorphisms Is Found in the 
Human Proll and Thr t-RNA Genes 

Our findings of multiple physically linked but highly 
informative clusters of biallelic polymorphisms in the 
TCR constant regions (Ca and C£2) led us to question 
whether this finding was a common phenomenon in the 
genome or specific to the TCR gene complexes. To as- 
sess this, we selected a known sequence of moderate 
length (approximately 1300 bp) surrounding the Proll 
and Thr t-RNA genes located on human chromosome 14 
(Chang etaL, 1986). Three sequence -tagged sites (STSs) 
were generated from the noncoding sequences surround- 
ing these t-RNA genes (each approximately 300 to 450 
bp in length). Using amplified DNA samples from six 
unrelated individuals and our direct DNA sequencing 
approach, we detected five single nucleotide substitu- 
tions within these STSs, three transitions and two 
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a. p values for the chi-square distribution of the Q statistic. 

b. Values for O. and the number of chromosomes examined. 

FIG. 6. The location, frequency, and linkage disequilibrium for the biallelic DNA polymorphisms identified in DNA sequences surrounding 
the human pro II and thr t-RNA genes. 



transversions as shown in Fig. 6. These polymorphisms 
have been designated HTl through HT5 according to 
their 5' to 3' location within the sequence (Fig. 6) and 
were all inherited in a Mendelian fashion. Each of these 
polymorphisms showed a moderate level of heterozygo- 
sity (24 to 42%) and had genotype distributions indicat- 
ing Hardy- Weinberg equilibrium when tested on a panel 
of 80 DNA samples (the CEPH parents) by PCR/OLA. 
Moreover, like the Ca and C/J2 polymorphisms, several 
of the HT polymorphisms also showed a moderate level 
of linkage disequilibrium with one another (Fig. 6). 
Therefore, three polymorphisms could be combined to 
generate a highly informative haplotype for this region, 
HT1/HT3/HT4 (combined heterozygosity, 72%); this 
seems to indicate that clusters of multiple linked bialle- 
lic polymorphisms may be a common phenomenon in 
the genome. 

DISCUSSION 

Single Nucleotide Substitutions Are Predominantly 
Transitions 

We have detected the presence of clusters of sequence 
polymorphisms in the human genome using a scanning 
approach employing DGGE and/or direct DNA se- 
quencing. Altogether, 16 single nucleotide substitutions 
were uncovered, 4 in the TCR Ca region, 7 in the TCR 
C/32 region, and 5 among the sequences surrounding two 
human t-RNA genes. All of these polymorphic substitu- 
tions were inherited in a Mendelian fashion and yielded 
allele distributions similar to those expected assuming 
Hardy- Weinberg equilibrium. Furthermore, these sub- 
stitutions also followed the general trends for the types 
of nucleotide substitutions that occur within the human 



genome (Vogel and Kopun, 1977). Specifically, transi- 
tional base changes appear more frequently than nu- 
cleotide transversions. Additionally, transitional base 
changes involving C and T are clearly favored over sub- 
stitutions involving G and A. Among the nucleotide sub- 
stitutions found surrounding the human t-RNA genes, 
and in the TCR Ca and C£2 regions, 10 of the 16 substi- 
tutions (62%) were transitional base changes, and sub- 
stitutions involving C and T were the most frequent type 
of transitional change (70%). 

DNA Sequence Polymorphisms Are Frequent 

Many estimates on the frequency fef polymorphism in 
the human genome have been reported. These estimates 
vary depending on the number of individuals examined 
and the technique used to scan the DNA segment, but 
they generally range from 1 in every 200 bp to 1 in every 
1000 bp (Cooper et aL, 1985; Miyamoto et ai, 1988). The 
overall distribution of DNA polymorphisms in the three 
regions we examined was 16 nucleotide variations in a 
total of 3755 bp of DNA, or 1 in every 235 bp. Consider- 
ing the wide range of variation that may exist in the 
distribution of sequence polymorphisms within the hu- 
man genome, our observation of a polymorphism distri- 
bution of 1 in every 235 bp among predominantly (90%) 
noncoding sequences probably represents a fairly neu- 
tral distribution of base substitution (Kimura, 1983). 

It is noteworthy that only 5 of the 16 substitutions in 
these three regions would have been identified as 
RFLPs, one of the most common scanning approaches 
for identifying DNA polymorphisms. Even denaturing 
gradient gels (in the absence of a GC-clamp) detected 
only a moderate number of DNA polymorphisms in the 
regions scanned (four of six polymorphisms among the 
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Ca and Q8 segments scanned). This level of sensitivity 
using several variables for the analysis, perpendicular 
and parallel gels, i.e., 0 to 80% and 30 to 80% gels, is 
consistent with other regions we have scanned by direct 
DNA sequencing and DGGE (random chromosome 14 
STSs and mouse TCR variable gene segments, C. Boy- 
sen, M. Toda, and D. A. Nickerson, unpublished obser- 
vations). The most accurate method for identifying 
DNA polymorphisms is of course direct DNA sequence 
analysis. Until recently, however, this type of scanning 
was tedious and mainly restricted to well-defined re- 
gions of the genome (Miyamoto et aL, 1988). The speed 
and efficiency of PCR amplification combined with new 
approaches for direct sequence analysis of amplified 
DNA products (Kretz et ai f 1989; Rosenthal and Jones, 
1990) provides a rapid and highly accurate system for 
scanning for DNA sequence polymorphisms as demon- 
strated here and previously by Yandell and Dryja ( 1989). 
Furthermore, sequence-based approaches to polymor- 
phism scanning can be automated to generate high 
throughput systems for analyzing amplified DNA sam- 
ples from multiple individuals (Wahlberg et a/., 1990; 
Wilson et aL, 1990). 

DNA Polymorphisms Are Frequently in Partial Linkage 
Disequilibrium 

We have found that DNA polymorphisms within an 
individual cluster, Ca, and t-RNA, were in partial 
linkage disequilibrium with one another. This is similar 
to findings for several other regions of the human ge- 
nome, i.e., the a-anti-trypsin and insulin genes (Cox et 
aL, 1985; Chakravarti et aL, 1986). In the present study, 
DNA polymorphisms (Ca3 and Ca4) separated by only 
52 bp were found to be in partial linkage disequilibrium. 
Furthermore, it is apparent that some of the polymor- 
phisms within these clusters remain in high levels of 
linkage disequilibrium (Cal and Ca3) while others 
found between (Ca2 and Ca4) or on either side of these 
polymorphisms exhibit only partial disequilibrium. 
Many factors could play a role in generating partial link- 
age disequilibrium between physically linked DNA poly- 
morphisms including random genetic drift as well as the 
rate of recombination between these polymorphisms 
over time in a population (Ohta, 1982; Thomson and 
Klitz, 1987). However, among DNA polymorphisms sep- 
arated by suchismall physical distances (<2000 bp), it is 
possible that the levels of linkage disequilibrium could 
reflect the evolutionary history of these polymorphisms 
over time in the population. For example, DNA poly- 
morphisms arising through a series of sequential muta- 
tional events could give rise to multiple hapiotypes in 
the population (Fig. 7). In contrast, pairs of DNA poly- 
morphisms arising more recently, or in tandem, i.e., 
closely timed mutations, might be at higher levels of dis- 
equilibrium with one another in the population (Fig. 7, 
haplotype III). The presence of multiple hapiotypes and 
partial disequilibrium in clusters of DNA polymor- 
phisms is clearly a phenomenon that can be exploited for 
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FIG. 7. A schematic diagram for the formation of multiple DNA 
hapiotypes generated in the population through a series of sequential 
single base mutations (hapiotypes II and IV) or through tandem (pairs 
of closely timed or recently generated) mutations (haplotype III) as 
suggested from the analysis of the TCR Ca hapiotypes. 

the development of alternative forms of highly informa- 
tive markers, i.e., clusters of pSTSs. Furthermore, par- 
tial linkage disequilibrium between DNA polymor- 
phisms in short DNA segments may well be a common 
finding in the genome since we have also detected partial 
disequilibrium among polymorphisms in DNA segments 
obtained by random cloning methods (data not shown). 

PCR/OLA Provides a High Throughput Genetic 
Mapping Procedure 

The development of high throughput systems for the 
identification and typing of sequence polymorphisms 
will greatly increase the speed and efficiency of genetic 
linkage mapping as well as the identification of the genes 
encoding traits of interest, such as disease-causing 
genes. In this regard and as our data show, pSTSs con- 
taining simple biallelic nucleotide substitutions offer sig- 
nificant advantages in the rapid development of linkage 
maps. First, single nucleotide substitutions are the most 
common and widely distributed^type of polymorphism in 
the genome. Therefore, it is likely that any random STS 
obtained from the genome would have a high probability 
of containing one or more single nucleotide substitu- 
tions depending on the size of the STS. Second, once 
identified, biallelic pSTSs can be typed using a rapid and 
semiautomatable system, PCR/OLA. Finally and more 
importantly, the outcome from the analysis of biallelic 
pSTSs by PCR/OLA is easy to interpret unequivocally 
as a positive or negative by a computer, a feature not 
easily achieved with size variant markers analyzed by gel 
electrophoresis, i.e., repeat polymorphisms. 

Clustered Polymorphisms in pSTSs Are Highly 
Informative 

The most useful genetic markers are those that are 
sufficiently informative to discriminate all four parental 
chromosomes. More informative genetic markers reduce 
the number of meioses required to map a specific trait 
and help to map genetic diseases with limited affected 
pedigrees. In this regard, the use of clusters of simple 
biallelic polymorphisms offers an alternative highly in- 
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formative marker system for genetic linkage mapping. 
For example, once a disease-causing gene is mapped to a 
defined framework interval (±5 cM), the identification 
of clusters of highly informative biallelic markers would 
provide an easily attained marker for localizing a dis- 
ease-causing gene(s) mapped to this interval. Addition- 
ally, the concept of identifying clusters of pSTSs can 
also be extended to large fragments of human DNA. For 
example, once a random biallelic pSTS with a single 
base substitution has been identified, the PCR primers 
as well as the OLA probes (both allele-specific probes 
combined with the reporter probe) can be used to rapidly 
screen an ordered array of YAC or cosmid clones to 
identify a larger physical fragment of human DNA con- 
taining this pSTS (P. Kwok and M. Olson, unpublished 
observation). This will provide a physical anchor for 
each pSTS as well as asource of linked DNA sequence to 
develop additional STSs for chromosome walking in a 
defined framework interval or to develop additional 
pSTSs markers to generate a highly informative marker 
at a defined location in the genome, i.e., in framework 
locations surrounding a disease-causing gene or pheno- 
typic trait. Therefore, the development of clusters of 
biallelic pSTSs can serve as a means of connecting the 
physical and genetic maps of the genome in addition to 
providing an alternative, automatable, and highly infor- 
mative marker system for genetic linkage mapping. 
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